Open In Colab

Enabling the GPU

Start this notebook with a GPU attached:

  • Navigate to Edit→Notebook Settings
  • select GPU from the Hardware Accelerator drop-down

Requirements

After installing bertopic click on restart runtime to proceed

!pip install bertopic
Collecting bertopic
  Downloading bertopic-0.8.1-py2.py3-none-any.whl (53 kB)
     |████████████████████████████████| 53 kB 2.0 MB/s 
Requirement already satisfied: tqdm>=4.41.1 in /usr/local/lib/python3.7/dist-packages (from bertopic) (4.41.1)
Collecting plotly<4.14.3,>=4.7.0
  Downloading plotly-4.14.2-py2.py3-none-any.whl (13.2 MB)
     |████████████████████████████████| 13.2 MB 142 kB/s 
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.7/dist-packages (from bertopic) (1.1.5)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /usr/local/lib/python3.7/dist-packages (from bertopic) (0.22.2.post1)
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.1.tar.gz (80 kB)
     |████████████████████████████████| 80 kB 8.6 MB/s 
Collecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
     |████████████████████████████████| 6.4 MB 21.3 MB/s 
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
     |████████████████████████████████| 85 kB 4.5 MB/s 
Collecting numpy>=1.20.0
  Using cached numpy-1.21.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (0.29.23)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (1.4.1)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic) (2.8.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly<4.14.3,>=4.7.0->bertopic) (1.3.3)
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.8.2-py3-none-any.whl (2.5 MB)
     |████████████████████████████████| 2.5 MB 39.5 MB/s 
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic) (1.9.0+cu102)
Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic) (0.10.0+cu102)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic) (3.2.5)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 37.7 MB/s 
Collecting huggingface-hub
  Downloading huggingface_hub-0.0.14-py3-none-any.whl (43 kB)
     |████████████████████████████████| 43 kB 1.6 MB/s 
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.6.0->sentence-transformers>=0.4.1->bertopic) (3.7.4.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (3.0.12)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2019.12.20)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (4.6.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (21.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2.23.0)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 39.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
     |████████████████████████████████| 895 kB 43.6 MB/s 
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (3.13)
Collecting huggingface-hub
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2.4.7)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.0->bertopic) (0.51.2)
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.4.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 37.2 MB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (57.2.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (0.34.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (3.5.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2021.5.30)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (3.0.4)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (7.1.2)
Requirement already satisfied: pillow>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (7.1.2)
Building wheels for collected packages: hdbscan, sentence-transformers, umap-learn, pynndescent
  Building wheel for hdbscan (PEP 517) ... done
  Created wheel for hdbscan: filename=hdbscan-0.8.27-cp37-cp37m-linux_x86_64.whl size=2311897 sha256=479edeb4446f4452d852623a69e1ef83f3b9416e7e3bf9c1772d1db84bde2ea6
  Stored in directory: /root/.cache/pip/wheels/73/5f/2f/9a259b84003b84847c259779206acecabb25ab56f1506ee72b
  Building wheel for sentence-transformers (setup.py) ... done
  Created wheel for sentence-transformers: filename=sentence_transformers-2.0.0-py3-none-any.whl size=126709 sha256=5a0595938b927993bf0b9ab03668ffef1e5bc8a354bd5d2b0382d389848df532
  Stored in directory: /root/.cache/pip/wheels/d1/c1/0f/faafd427f705c4b012274ba60d9a91d75830306811e1355293
  Building wheel for umap-learn (setup.py) ... done
  Created wheel for umap-learn: filename=umap_learn-0.5.1-py3-none-any.whl size=76566 sha256=ba186ddc81cc76352c5869d51df1f09973f2eded2e3eb1f6dd3724fce1668ebf
  Stored in directory: /root/.cache/pip/wheels/01/e7/bb/347dc0e510803d7116a13d592b10cc68262da56a8eec4dd72f
  Building wheel for pynndescent (setup.py) ... done
  Created wheel for pynndescent: filename=pynndescent-0.5.4-py3-none-any.whl size=52372 sha256=91f984c49ad4bae7a60a81e2ca1f7f3c35b443d57b1c82c272ed78c84df14438
  Stored in directory: /root/.cache/pip/wheels/d0/5b/62/3401692ddad12324249c774c4b15ccb046946021e2b581c043
Successfully built hdbscan sentence-transformers umap-learn pynndescent
Installing collected packages: numpy, tokenizers, sacremoses, huggingface-hub, transformers, sentencepiece, pynndescent, umap-learn, sentence-transformers, plotly, hdbscan, bertopic
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed bertopic-0.8.1 hdbscan-0.8.27 huggingface-hub-0.0.12 numpy-1.21.1 plotly-4.14.2 pynndescent-0.5.4 sacremoses-0.0.45 sentence-transformers-2.0.0 sentencepiece-0.1.96 tokenizers-0.10.3 transformers-4.8.2 umap-learn-0.5.1
!pip install numba --upgrade
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (0.51.2)
Collecting numba
  Downloading numba-0.53.1-cp37-cp37m-manylinux2014_x86_64.whl (3.4 MB)
     |████████████████████████████████| 3.4 MB 12.8 MB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba) (57.2.0)
Collecting llvmlite<0.37,>=0.36.0rc1
  Downloading llvmlite-0.36.0-cp37-cp37m-manylinux2010_x86_64.whl (25.3 MB)
     |████████████████████████████████| 25.3 MB 77 kB/s 
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.7/dist-packages (from numba) (1.21.1)
Installing collected packages: llvmlite, numba
  Attempting uninstall: llvmlite
    Found existing installation: llvmlite 0.34.0
    Uninstalling llvmlite-0.34.0:
      Successfully uninstalled llvmlite-0.34.0
  Attempting uninstall: numba
    Found existing installation: numba 0.51.2
    Uninstalling numba-0.51.2:
      Successfully uninstalled numba-0.51.2
Successfully installed llvmlite-0.36.0 numba-0.53.1

Load data

You need to create a kaggle.json in your kaggle account and upload it via the below cell

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 62 bytes
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()
api.dataset_download_file('pqbsbk/german-news-dataset',
                          file_name='data.csv')
True
!unzip data.csv.zip
Archive:  data.csv.zip
  inflating: data.csv                
!ls
data.csv  data.csv.zip	sample_data
import pandas as pd
df = pd.read_csv("data.csv")
df = df[df["text"].notnull()]

Load trained model

from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
!ls /content/gdrive/MyDrive
'Colab Notebooks'	   NER_01_030_2021_07_08.bin
 MC_01_03_29_06_2021.bin   ner.bin
 MC_02_06_07_02_21.bin	   tm_bert_topic
 MC_03_09_07_05_21.bin	   tm_bert_topic_2021_07_19__12_54_00
 multi_class.bin	   tm_bert_topic_2021_07_19__14_33_09
from bertopic import BERTopic
topic_model = BERTopic.load("/content/gdrive/MyDrive/tm_bert_topic_2021_07_19__14_33_09")

Model inference

topics, probs = topic_model.transform(list(df["text"])[:25000])

/usr/local/lib/python3.7/dist-packages/numba/np/ufunc/parallel.py:365: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.
  warnings.warn(problem)

Visualization

Visualize Topics

topic_model.visualize_topics()
/usr/local/lib/python3.7/dist-packages/numba/np/ufunc/parallel.py:365: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.
  warnings.warn(problem)

Visualize Topic Hierarchy

topic_model.visualize_hierarchy(top_n_topics=50)

Visualize Terms

topic_model.visualize_barchart(top_n_topics=5)

Visualize Topic Similarity

topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

Visualize Term Score Decline

topic_model.visualize_term_rank()

Search Topics

similar_topics, similarity = topic_model.find_topics("iran", top_n=5); similar_topics
[376, 370, 294, 112, 118]
topic_model.get_topic(370)
[('iran', 0.045999810020958744),
 ('atomabkommen', 0.021954321831109038),
 ('sicherheitsrat', 0.02192809918196278),
 ('sanktionen', 0.020971347590174532),
 ('iranische', 0.012191527878246075),
 ('atomabkommens', 0.011997523481121063),
 ('israel', 0.011565215339411601),
 ('abkommen', 0.010138687569399864),
 ('atomwaffen', 0.008668697396711857),
 ('embargos', 0.006550777015760913)]